NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ripple: Asynchronous Programming for Spatial Dataflow Architectures

https://doi.org/10.1145/3729256

Ghosh, Souradip; Shi, Yufei; Lucia, Brandon; Beckmann, Nathan (June 2025, Proceedings of the ACM on Programming Languages)

Spatial dataflow architectures (SDAs) are a promising and versatile accelerator platform. They are software-programmable and achieve near-ASIC performance and energy efficiency, beating CPUs by orders of magnitude. Unfortunately, many SDAs struggle to efficiently implement irregular computations because they suffer from an abstraction inversion: they fail to capture coarse-grain dataflow semantics in the application — namely asynchronous communication, pipelining, and queueing — that are naturally supported by the dataflow execution model and existing SDA hardware. Ripple is a language and architecture that corrects the abstraction inversion by preserving dataflow semantics down the stack. Ripple provides asynchronous iterators, shared-memory atomics, and a familiar task-parallel interface to concisely express the asynchronous pipeline parallelism enabled by an SDA. Ripple efficiently implements deadlock-free, asynchronous task communication by exposing hardware token queues in its ISA. Across nine important workloads, compared to a recent ordered-dataflow SDA, Ripple shrinks programs by 1.9×, improves performance by 3×, increases IPC by 58%, and reduces dynamic instructions by 44%.
more » « less
Full Text Available
Leviathan: A Unified System for General-Purpose Near-Data Computing

https://doi.org/10.1109/MICRO61859.2024.00095

Schwedock, Brian C; Beckmann, Nathan (November 2024, IEEE)

The rising cost of data movement poses a significant challenge to future computing systems. The call to arms for novel data-centric systems has spawned a wave of near-data computing (NDC) architectures that move compute closer to data. Despite large benefits promised by NDC, prior designs suffer from limited applicability and difficult programming. This paper identifies the commonalities and differences across NDC designs to develop Leviathan, a unified architecture and programming interface for near-cache NDC. We build a taxonomy of NDC and identify the key dimensions as what, where, and when to compute. Leviathan provides a simple reactive-programming interface and automatically executes actions near data at the right time and place. The ability to integrate multiple NDC paradigms makes Leviathan the only general-purpose system to support a variety of specialized NDC designs. Across a range of NDC-specialized applications, Leviathan improves performance by 1.5×±3.7× and reduces energy by 22%±77% vs. a baseline multicore, while adding only ≈6% area compared to the last-level cache.
more » « less
Full Text Available
Pipestitch: An energy-minimal dataflow architecture with lightweight threads

https://doi.org/10.1145/3613424.3614283

Serafin, Nathan; Ghosh, Souradip; Desai, Harsh; Beckmann, Nathan; Lucia, Brandon (October 2023, Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '23))

Computing at the extreme edge allows systems with high-resolution sensors to be pushed well outside the reach of traditional communication and power delivery, requiring high-performance, high-energy-efficiency architectures to run complex ML, DSP, image processing, etc. Recent work has demonstrated the suitability of CGRAs for energy-minimal computation, but has focused strictly on energy optimization, neglecting performance. Pipestitch is an energy-minimal CGRA architecture that adds lightweight hardware threads to ordered dataflow, exploiting abundant, untapped parallelism in the complex workloads needed to meet the demands of emerging sensing applications. Pipestitch introduces a programming model, control-flow operator, and synchronization network to allow lightweight hardware threads to pipeline on the CGRA fabric. Across 5 important sparse workloads, Pipestitch achieves a 3.49 × increase in performance over RipTide, the state-of-the-art, at a cost of a 1.10 × increase in area and a 1.05 × increase in energy.
more » « less
Baleen: ML Admission & Prefetching for Flash Caches

Wong, Daniel Lin-Kit; Wu, Hao; Molder, Carson; Gunasekar, Sathya; Lu, Sathya; Khandkar, Snehal; Sharma, Abhinav; Berger, Daniel S; Beckmann, Nathan; Ganger, Gregory R (February 2024, Usenix)

Flash caches are used to reduce peak backend load for throughput-constrained data center services, reducing the total number of backend servers required. Bulk storage systems are a large-scale example, backed by high-capacity but low-throughput hard disks, and using flash caches to provide a more cost-effective storage layer underlying everything from blobstores to data warehouses. However, flash caches must address the limited write endurance of flash by limiting the long-term average flash write rate to avoid premature wearout. To do so, most flash caches must use admission policies to filter cache insertions and maximize the workload-reduction value of each flash write. The Baleen flash cache uses coordinated ML admission and prefetching to reduce peak backend load. After learning painful lessons with our early ML policy attempts, we exploit a new cache residency model (which we call episodes) to guide model training. We focus on optimizing for an end-to-end system metric (Disk-head Time) that measures backend load more accurately than IO miss rate or byte miss rate. Evaluation using Meta traces from seven storage clusters shows that Baleen reduces Peak Disk-head Time (and hence the number of backend hard disks required) by 12% over state-of-the-art policies for a fixed flash write rate constraint. Baleen-TCO, which chooses an optimal flash write rate, reduces our estimated total cost of ownership (TCO) by 17%. Code and traces are available at https://www.pdl.cmu.edu/CILES/.
more » « less
Full Text Available
MANIC: A $$19\mu\mathrm{W}$$ @ 4MHz, 256 MOPS/mW, RISC-V microcontroller with embedded MRAM main memory and vector-dataflow co-processor in 22nm bulk finFET CMOS

https://doi.org/10.1109/ISCAS46773.2023.10181809

Gobieski, Graham; Atli, Oguz; Erbagci, Cagri; Mai, Ken; Beckmann, Nathan; Lucia, Brandon (May 2023, 2023 IEEE International Symposium on Circuits and Systems (ISCAS))

Whether powered by a battery or energy harvested from the environment, low-power (LP) sensor devices require extreme energy efficiency. These sorts of devices are becoming pervasive, running increasingly sophisticated applications in inhospitable environments. We present Manic, an energy-efficient microcontroller (MCU) augmented with a vector-dataflow (VDF) co-processor. The testchip taped out on a 22nm bulk finFET CMOS process demonstrates that Manic is 60% more energy-efficient than a baseline, scalar, low-power MCU, achieving peak efficiency of 256 MOPS/mW (2.6× prior work) while consuming only 19.1μW (@4MHz). To make the system viable for intermittently powered applications that require non-volatile storage, Manic includes a 256KB embedded MRAM.
more » « less
Kobold: Simplified Cache Coherence for Cache-Attached Accelerators

https://doi.org/10.1109/LCA.2023.3269399

Brana, Jennifer; Schwedock, Brian C.; Manerkar, Yatin A.; Beckmann, Nathan (January 2023, IEEE Computer Architecture Letters)

Full Text Available
Brief Announcement: Spatial Locality and Granularity Change in Caching

https://doi.org/10.1145/3490148.3538559

Beckmann, Nathan; Gibbons, Phillip B.; McGuffey, Charles (July 2022, SPAA '22: 34th ACM Symposium on Parallelism in Algorithms and Architectures)

Full Text Available
täkō: a polymorphic cache hierarchy for general-purpose optimization of data movement

https://doi.org/10.1145/3470496.3527379

Schwedock, Brian C.; Yoovidhya, Piratach; Seibert, Jennifer; Beckmann, Nathan (June 2022, International Symposium on Computer Architecture)

Current systems hide data movement from software behind the load-store interface. Software’s inability to observe and respond to data movement is the root cause of many inefficiencies, including the growing fraction of execution time and energy devoted to data movement itself. Recent specialized memory-hierarchy designs prove that large data-movement savings are possible. However, these designs require custom hardware, raising a large barrier to their practical adoption. This paper argues that the hardware-software interface is the problem, and custom hardware is often unnecessary with an expanded interface. The täkō architecture lets software observe data movement and interpose when desired. Specifically, caches in täkō can trigger software callbacks in response to misses, evictions, and writebacks. Callbacks run on reconfigurable dataflow engines placed near caches. Five case studies show that this interface covers a wide range of data-movement features and optimizations. Microarchitecturally, täkō is similar to recent near-data computing designs, adding ≈5% area to a baseline multicore. täkō improves performance by 1.4×–4.2×, similar to prior custom hardware designs, and comes within 1.8% of an idealized implementation.
more » « less
Full Text Available
RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture

https://doi.org/10.1109/MICRO56248.2022.00046

Gobieski, Graham; Ghosh, Souradip; Heule, Marijn; Mowry, Todd; Nowatzki, Tony; Beckmann, Nathan; Lucia, Brandon (October 2022, 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO))

Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6 × energy v. a von Neumann core.
more » « less
Full Text Available
Brief Announcement: Block-Granularity-Aware Caching

https://doi.org/10.1145/3409964.3461818

Beckmann, Nathan; Gibbons, Phillip B.; McGuffey, Charles (July 2021, SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures)

Full Text Available

« Prev Next »

Search for: All records